巴西专利BR112016022329B1 defect processing method, related apparatus, and computer

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Embodiments of the present invention provide a method for defect processing, a related apparatus, and a computer. when it is determined that a computer is down, a baseboard management controller (12) on the computer may send a read request message to a processor (11) on the computer, where the read request message is used to request to read first error data written by processor (11), receive a read response message returned by processor (11), and obtain, according to the read response message, the first error data recorded by processor (11) 11). By means of embodiments of the present invention, an operating system need not be used, the acquisition of error data in a computer after the computer has crashed is deployed using a baseboard management controller (12), and a problem In the prior art error data on a computer cannot be acquired after an uncorrectable error that occurs on the computer causes a system crash to be resolved.
公开号:BR112016022329B1
申请号:R112016022329
申请日:2014-06-24
公开日:2019-01-02
发明作者:Song Gang
申请人:Huawei Tech Co Ltd；
IPC主号:

专利说明:

“METHOD FOR DEFECT PROCESSING, RELATED APPLIANCE, AND COMPUTER”
FIELD OF TECHNIQUE [0001] The modalities of the present invention relate to computer technologies and, in particular, to a method for defect processing, a related apparatus, and a computer.
BACKGROUND [0002] With the large-scale development of information technologies, computers are widely applied in various fields. Defects in a computer can generally include a software defect, a hardware defect, an operation (configuration) defect, and other defects. A hardware defect has characteristics such as difficulty in reproduction, determination depending mainly on the team's experience, difficulty in locating a defect when an error occurs, the need to insert and remove / replace multiple times, and the like. Therefore, a hardware defect, for example, a defect that occurs in memory, in a processor, in an input-output (IO) device or the like, is usually the most difficult to process.
[0003] In general, a hardware defect causes an incorrigible error (uncorrectable error) in a computer. An uncorrectable error can not only cause a computer service interruption and shorten the computer's uptime, but it can even cause a stoppage to occur. In the prior art, a defect in a computer is processed mainly using the following method: when an incorrigible error occurs in a system, a processor records error data and sends a notification to an operating system (Operating System, OS). After receiving the notification, the OS captures the error data recorded by the processor and prints the error data, so that a user can analyze, locate a defect and make a recovery regarding the defect.
[0004] In the prior art, an OS is required to implement error data capture. However, once a serious incorrigible error occurs on a computer and causes the computer to crash (in the present invention, a computer crash refers to the occurrence of a blank screen on the computer, no input will be accepted using a
8/165
2/39 input device such as a computer mouse or keyboard, and a computer processor cannot execute a computer instruction), the OS can no longer function and cannot capture error data on the computer, making it difficult to analyze, process defect and make a recovery from the defect.
SUMMARY [0005] The modalities of the present invention propose a method for defect processing, a related device, and a computer, so that error data on a computer can be acquired after a serious incorrigible error occurs on the computer and causes the computer to run. a crash.
[0006] According to a first aspect, one embodiment of the present invention proposes a computer, including a processor and a baseboard management controller, where the baseboard management controller is configured for: when it is determined that the computer is in a panic, send a read request message to the processor, where the read request message is used to request the reading of the first error data recorded by the processor;
the processor is configured to receive the read request message, and send a read response message to the baseboard management controller; and the baseboard management controller is configured to receive the read response message returned by the processor, and obtain, according to the read response message, the first error data recorded by the processor.
[0007] With reference to the first aspect, in a first possible deployment way, the processor is additionally configured to acquire the first error data, and to record the first error data; and the baseboard management controller is configured to determine that the computer is out of order specifically means: the baseboard management controller is configured to receive an indication of a serious defect occurrence sent by the processor, where the occurrence indication major defect is sent by the processor when the processor acquires the first error data and the first error data
9/165
3/39 are of a type of serious incorrigible error; and if at least a portion of the first error data sent by the processor is not received within a predefined waiting time that starts from the time the indication of the occurrence of a serious defect is received, the card management controller- base is configured to determine that the computer is in a panic.
[0008] With reference to the first aspect or the first possible way of implantation of the first aspect, in a second possible way of implantation, the baseboard management controller be configured to obtain, according to the read response message, the first error data written by the processor specifically means: when the read response message carries the first error data, the baseplate manager is configured to obtain, from the read response message, the first error data recorded by the processor.
[0009] With reference to the first aspect or the first possible way of implantation of the first aspect, in a third way of possible implantation, the baseboard management controller be configured to obtain, according to the read response message, the first error data recorded by the processor specifically means: when the read response message carries a read failure indication, the baseboard management controller is configured to instruct a hot restart module or a computer user to perform a warm restart on the computer, where the read failure indication is used to indicate that the first error data was not read from the processor, so that the processor executes, during the hot restart of the computer, an instruction of defect collection of a basic computer input-output system, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data to the baseboard management controller; and the baseboard management controller is configured to receive the first error data sent by the processor.
[0010] With reference to the first aspect or any of the first to third possible ways of implantation of the first aspect, in
10/165
4/39 a fourth possible deployment method, the baseplate management controller is additionally configured to parse the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data.
[0011] With reference to the fourth possible deployment way of the first aspect, in a fifth possible deployment way, the baseboard management controller is additionally configured to analyze defect parsing information from the first error data accordingly with a predefined defect processing mechanism, to obtain a defect processing suggestion.
[0012] With reference to the fifth possible deployment way of the first aspect, in a sixth possible deployment way, before it is determined that the computer is in a panic, the baseboard management controller is additionally configured to receive second data from error sent by the processor, and analyze the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, where the second error data is error data generated within a predefined time before the computer generates the first error data; and the baseboard management controller is configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, for a defect processing suggestion includes: the management controller baseplate is configured to analyze the defect parsing information of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the processing suggestion defect.
[0013] According to a second aspect, one embodiment of the present invention proposes a method for defect processing, applied to a computer including a baseboard management controller and a processor, and the method includes:
when it is determined that the computer is out of order, send at least
11/165
5/39 baseboard management controller, a read request message to the processor, where the read request message is used to request read of first error data recorded by the processor; and receiving, by the baseplate management controller, a read response message returned by the processor, and obtain, according to the read response message, the first error data recorded by the processor.
[0014] With reference to the second aspect, in a first possible deployment manner, the method additionally includes: receiving, by the baseplate management controller, an indication of the occurrence of a serious defect sent by the processor, where the indication of the occurrence of serious defect is sent by the processor when the processor acquires the first error data and the first error data is of a type of serious incorrigible error; and if at least a portion of the first error data sent by the processor is not received within a predefined waiting time that starts from the time the indication of the occurrence of a serious defect is received, determine that the computer is in trouble .
[0015] With reference to the second aspect or the first possible way of implanting the second aspect, in a second way of possible implantation, the receipt, by the baseplate management controller, of a read response message returned by the processor, and obtaining, according to the read response message, the first error data recorded by the processor include: when the read response message carries the first error data, obtain, by the baseboard management controller from the read response message, the first error data recorded by the processor.
[0016] With reference to the second aspect or the first possible way of implantation of the second aspect, in a third way of possible implantation, the receipt, by the baseplate management controller, of a read response message returned by the processor, and obtaining, according to the read response message, the first error data recorded by the processor include: when the read response message carries a read failure indication, instruct, by the baseboard management controller, a module of
12/165
6/39 hot restart or a computer user performing a hot restart on the computer, so that the processor executes, during the hot restart of the computer, a defect collection instruction for a basic computer input-output system computer, acquires the first error data according to the defect collection instruction of the basic input-output system, and sends the first error data to the baseboard management controller, where the read failure indication is used to indicate that the first error data was not read from the processor; and receiving, by the placabase management controller, the first error data sent by the processor.
[0017] With reference to the second aspect or any of the first to the third possible ways of implantation of the second aspect, in a fourth possible way of implantation, after obtaining, by the baseplate management controller according to the response message of reading, of the first error data recorded by the processor, the method additionally includes: parsing, by the baseboard management controller, the first error data according to a defect parsing mechanism, to obtain analysis information defect syntax of the first error data.
[0018] With reference to the fourth possible implementation of the second aspect, in a fifth possible implementation, the method additionally includes: analyzing, by the baseplate management controller, the defect parsing information of the first data of error according to a predefined defect processing mechanism, to obtain a defect processing suggestion.
[0019] With reference to the fifth possible way of implanting the second aspect, in a sixth way of possible implantation, before it is determined by the baseboard management controller that the computer is in a panic, the method additionally includes: receiving , by the baseboard management controller, second error data sent by the processor, where the second error data is error data generated within a predefined time before the computer generates the first error data; and the analysis, by the baseboard management controller, of the
13/165
7/39 defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion includes:
parsing, by the placabase management controller, the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, and parsing the defect parsing information from the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the defect processing suggestion.
[0020] According to a third aspect, an embodiment of the present invention proposes a baseplate management controller, including:
a sending unit, configured for: when it is determined that the computer is in a state of failure, send a read request message to the processor, where the read request message is used to request the reading of the first error data recorded by the processor; and a receiving unit, configured to receive a read response message returned by the processor, and obtain, according to the read response message, the first error data recorded by the processor.
[0021] With reference to the third aspect, in a first possible deployment way, the placabase management controller additionally includes: a determination unit, configured to receive an indication of the occurrence of a serious defect sent by the processor, where the indication of occurrence serious defect is sent by the processor when the processor acquires the first error data and the first error data is of a type of serious incorrigible error; and if at least a portion of the first error data sent by the processor is not received within a predefined waiting time that starts from the time the indication of the occurrence of a serious defect is received, determine that the computer is in trouble .
[0022] With reference to the third aspect or the first way
14/165
8/39 possible deployment of the third aspect, in a second possible deployment way, the receiving unit receives a read response message returned by the processor, and obtains, according to the read response message, the first error messages recorded by the processor include: when the read response message carries the first error data, the receiving unit obtains, from the read response message, the first error data recorded by the processor.
[0023] With reference to the third aspect or the first possible way of implantation of the third aspect, in a third way of possible implantation, the receiving unit receives a read response message returned by the processor, and obtain, according to the message read response, the first error data recorded by the processor includes:
when the read response message carries a read failure indication, the receiving unit instructs a warm reset unit or a computer user to perform a warm reset on the computer, so that the processor performs, during the reset computer warm-up, a defect collection instruction from a basic computer input-output system, acquires the first error data according to the defect collection instruction from the basic input-output system, and sends the first data error messages for the receiving unit, where the read failure indication is used to indicate that the first error data was not read from the processor; and the receiving unit receives the first error data sent by the processor.
[0024] With reference to the third aspect or any of the first to the third possible implantation modes of the third aspect, in a fourth possible implantation mode, the baseplate management controller additionally includes: a defect processing unit, configured to parse the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data.
[0025] With reference to a fourth way of implantation
15/165
9/39 possible of the third aspect, in a fifth possible deployment way, the defect processing unit is additionally configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, for a defect processing suggestion.
[0026] With reference to the fifth possible way of implantation of the third aspect, in a sixth possible way of implantation, the receiving unit is additionally configured to receive second error data sent by the processor; the defect processing unit is additionally configured to analyze the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, where the second error data is data from error generated within a predefined time before the computer generates the first error data; and the defect processing unit is configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion includes: the defect processing unit analyzes the defect parsing information of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the defect processing suggestion.
[0027] In accordance with a fourth aspect, one embodiment of the present invention proposes a baseboard management controller, where the baseboard management controller includes a processor, a memory, a bus, and a communications interface, where memory is configured to store a computer executable instruction, the processor is connected to memory using the bus, and when the baseplate management controller runs, the processor executes the computer executable instruction stored in memory, so that the baseboard management controller performs the method for defect processing according to the second aspect, or the method for defect processing according to any of the possible ways of implanting the second aspect.
16/165
According to a fifth aspect, one embodiment of the present invention proposes a computer-readable medium, including a computer-executable instruction, so that when a computer processor executes the computer-executable instruction, the computer performs the method for defect processing according to the second aspect, or the method for defect processing according to any of the possible implantation ways of the second aspect.
[0029] In the embodiments of the present invention, when it is determined that a computer is out of order, a baseboard management controller in the computer can send a read request message to a processor on the computer, where the read request message it is used to request the reading of the first error data recorded by the processor, to receive a read response message returned by the processor, and to obtain, according to the read response message, the first error data written by the processor. By means of the aforementioned way, an operating system does not need to be used, only a baseboard management controller is needed to deploy error data acquisition to a computer after the computer is out of order, and a problem in the prior art that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash to resolve.
BRIEF DESCRIPTION OF THE DRAWINGS [0030] To describe the technical solutions in the modalities of the present invention more clearly, the following are briefly presented the attached drawings required to describe the modalities or by the prior art. Of course, the accompanying drawings in the description below show merely some embodiments of the present invention, and the person of ordinary skill in the art can still derive other designs from these attached drawings without creative efforts.
[0031] Figure 1 is a schematic diagram of a computer according to an embodiment of the present invention;
[0032] Figure 2 is a schematic diagram of another computer according to an embodiment of the present invention;
[0033] Figure 3 is a method flow chart of a method for
17/165
Defect processing according to an embodiment of the present invention;
[0034] Figure 4 is a method flowchart of another method for defect processing according to an embodiment of the present invention;
[0035] Figure 5 is a schematic diagram of a baseboard management controller according to an embodiment of the present invention; and [0036] Figure 6 is a schematic structural diagram of the composition of another baseplate management controller according to an embodiment of the present invention.
DESCRIPTION OF MODALITIES [0037] The modalities of the present invention propose a method for defect processing, a related device, and a computer, so that error data on a computer can be acquired after a serious incorrigible error occurs on the computer and causes a computer crash.
[0038] It should be noted that, in the specification, claims and attached drawings of the present invention, the terms first (a) and second (a) are intended to distinguish between similar objects, but do not necessarily indicate a specific order or sequence. It should be understood that the numbers used in this way are interchangeable in appropriate cases. In the specification and the claims and accompanying drawings of the specification for the present invention, a computer crash refers to the occurrence of a blank screen on the computer, a computer processor cannot execute a computer instruction, and no input is accepted using an input device such as a computer mouse or keyboard.
Mode 1 [0039] Figure 1 is a schematic diagram of a computer according to this modality of the present invention. The computer includes a processor 11 and a baseplate management controller 12 (Baseplate Management Controller, BMC).
[0040] The baseplate management controller 12 is
18/165
12/39 configured for: when it is determined that the computer is in a panic, send a read request message to processor 11, where the read request message is used to request read of the first error data recorded by processor 11, where the first error data is error data generated on the computer, and can be all error data generated on the computer, or can additionally be a part of the error data generated on the computer. For example, the first error data can be error data generated up to 2 seconds before the computer crashes, with no limitation in this document regarding this embodiment of the present invention.
[0041] Processor 11 is configured to receive the read request message, and send a read response message to the baseboard management controller 12. At that time, although the computer has panicked and the processor has not can execute any computer instruction, the processor can receive and respond to the read request message.
[0042] The baseplate management controller 12 is configured to receive the read response message returned by processor 11, and obtain, according to the read response message, the first error data recorded by processor 11.
[0043] For example, processor 11 can write the first error data to a processor 11 record. The baseplate management controller 12 can send a read request message to processor 11 using an address from the record, to acquire the first error data from the registry. Although the computer has panicked and cannot execute a computer instruction, processor log 11 can respond to the read request message and return a read response message, for example, return the first error data, so that the baseplate management controller 12 can obtain the first error data according to the read response message. It should be noted that, in this embodiment of the present invention, the first error data may include one or more pieces of error data, and there is no limitation in this document as to that embodiment of the present invention.
19/165
[0044] In this embodiment of the present invention, when it is determined that the computer is out of order, a baseplate management controller 12 can send a read request message to a processor 11, where the request message read is used to request read of first error data recorded by processor 11, receive a read response message returned by processor 11, and obtain the first error data according to the read response message written by processor 11. In this embodiment of the present invention, an operating system does not need to be used, only a baseboard management controller is required to deploy error data acquisition to a computer after the computer is out of order, and a problem in the prior art that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash is r resolved.
[0045] This embodiment of the present invention is presented in detail below.
[0046] (1) About determining a computer crash [0047] In general, an incorrigible error (incorrigible error) caused by a defect in a computer can be categorized into a catastrophic error (Catastrophic Error), a fatal error (Fatal Error), and a recoverable error (Recoverable Error). The catastrophic error and the fatal error are the most serious, and can cause a blue screen, a purple screen or even a crash (for example, a blank screen and a suspension) to occur on the computer. Therefore, a catastrophic error or a fatal error on the computer can be monitored. For example, an internal error (Internal Error, IERR, which belongs to a catastrophic error) or a machine check error (Machine Check Error, MCERR, which belongs to a fatal error) is monitored. When a catastrophic error or fatal error occurs on the computer, if the computer cannot execute an instruction from a basic input-output system (Basic Input-Output System, BIOS) or an instruction from an operating system (Operating System, OS ), it can be determined that the computer is in trouble.
[0048] Specifically, processor 11 can be configured
20/165
14/39 additionally to acquire the first error data, and to record the first error data. For example, processor 11 can generate or receive the first error data, and write the first error data to a computer cache or to the processor 11 registry or another module having a storage capacity. In one aspect, after processor 11 acquires the first error data, if the computer does not panic, processor 11 can send the first error data to the baseboard management controller, for example, configuring an instruction error collection of the basic input-output system on the computer in advance. If the computer does not panic, processor 11 executes the error collection instruction of the basic input-output system, and sends the first error data to the baseplate management controller 12 according to the collection instruction. error of the basic input-output system. If the computer is out of order, processor 11 cannot execute any computer instructions. In another aspect, after processor 11 acquires the first error data, if the first error data is of a type of serious uncorrectable error, processor 11 can additionally send an indication of the occurrence of a serious defect, to notify the management controller baseplate 12 that a catastrophic error or fatal error is occurring on the computer and can cause a crash. The first error data being of the serious incorrigible type refers to the fact that the first error data belongs to a catastrophic error or a fatal error. Therefore, the baseplate management controller 12 can be configured to receive the indication of a serious defect occurrence sent by processor 11. If at least a portion of the first error data sent by processor 11 is not received within a predefined delay that starts from the time the indication of the occurrence of a serious defect is received, the baseplate management controller 12 can determine that the computer is in a panic.
[0049] In addition, the baseboard management controller 12 can additionally determine, according to a notification from a user, that the computer is in trouble. For example, when the computer is found to be in a panic, the user can notify the baseboard management controller 12, and the board management controller
21/165
15/39 base 12 can determine, according to the user's notification, that the computer is in trouble, in order to start acquiring the first error data.
[0050] (2) About first error data acquisition [0051] When receiving the read request message, processor 11 can add the first error data to the read response message according to the read request message and return the read response message to the baseboard management controller 12. At that time, the baseboard management controller 12 reads the data successfully, and the baseboard management controller 12 can obtain, from the read response message, the first error data recorded by processor 11.
[0052] However, when some hardware defects cause an incorrigible error and additionally result in a computer crash, the baseplate management controller 12 may not read the first error data, and the read response message will carry one. read failure indication, where the read failure indication is used to indicate that the first error data was not read from processor 11. The baseplate management controller 12 can be configured to instruct a restart module hot or a computer user performing a hot restart on the computer, so that processor 11 executes, during the hot restart of the computer, a defect collection instruction for a basic computer input-output system, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data to the co baseboard management controller 12. The baseboard management controller 12 can receive the first error data sent by processor 11, to complete the acquisition of the first error data.
[0053] It should be noted that a restart of a computer can be categorized as a warm restart and a cold restart. During a cold restart, the computer shuts down, the computer boots, and after a cold restart, a loss of information can be caused. For example, after a
22/165
16/39 cold reset, information saved in a processor record is lost. A cold reset is performed on the computer when a reset power switch is pressed. Unlike a cold restart, during a warm restart, the computer does not shut down, the computer does not start, and information saved in the processor registry is not lost. A warm restart refers to clicking restart at the beginning to shut down and start the computer according to normal procedure. In that embodiment and subsequent embodiments of the present invention, a hot restart performed on the computer has the same meaning as above.
[0054] In addition, the baseplate management controller 12 can be additionally configured to: after the first error data is acquired, send a clear data message to processor 11, to instruct processor 11 to delete the first error data recorded by processor 11, thereby avoiding a waste of a storage resource.
[0055] Optionally, the board management controller 12 can be additionally configured to: after the indication of the occurrence of a serious defect sent by processor 11 is received, send an alarm message to a computer defect alarm module or perform a printing operation, in order to notify the user of the occurrence of a serious defect alarm to enable the user to acquire a defect in the computer in time.
[0056] (3) About analyzing, locating, and processing a defect [0057] In the prior art, in general, only error data in a case in which a computer does not panic can be printed, so there is no recording of defects, and a defect can be analyzed, located and processed only in a manual manner. In this embodiment of the present invention, the baseboard management controller 12 can record a complete defect recording, and furthermore automatically locate a defect source and provide a defect processing suggestion, which provides assistance in processing a defect and making a recovery from the defect in time. A specific solution is
23/165
17/39 as follows:
[0058] The first error data recorded by processor 11 is generally information represented by 0 or 1. Therefore, the baseplate management controller 12 can be additionally configured to parse the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data. The defect parsing information for the first error data can include: the moment when each piece of error data in the first error data is generated, who collects the error data, from which processor the error data comes from, from which core (Core), what error the error data belongs to, and the like. For example, in the case of an X86 computer, the baseplate management controller 12 can parse, according to Intel's defect code definitions, the first error data in a binary form, to obtain the information defect parsing. Defect parsing information can not only be provided to a maintenance team or to a user to understand a case of a defect, but can also be used additionally for subsequent location, analysis, and processing of the defect.
[0059] The baseplate management controller 12 can be additionally configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion. The default defect processing mechanism can be a defect mechanism or a defect processing experience for X86, and the defect processing suggestion obtained can include defect location information and / or processing suggestion information, so that the user or a defect rectification team can perform processing on the computer according to the defect processing suggestion to recover the computer. In addition, the first error data can be just error data generated within a very short period of time before the computer crashes. For example, the first error data is error data generated up to 0.5 seconds before the computer
24/165
18/39 To improve the accuracy of locating and analyzing a defect, defect parsing information from more error data can be analyzed. Specifically, before it is determined that the computer is out of order, the baseplate management controller 12 can additionally receive second error data sent by processor 11, where the second error data is different from the first error data, and second error data is error data generated within a predefined time before the computer generates the first error data. The baseplate management controller 12 can parse the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, and parse the parsing information from defect of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the defect processing suggestion. For example, the first error data can be error data generated up to 0.5 second before the computer crashes, and when the default time is 4.5 seconds, the second error data can be error data generated up to 5 seconds before the computer goes down less than 0.5 seconds before the computer goes down; in that case, the baseboard management controller 12 can analyze, according to the predefined defect processing mechanism, the defect parsing information of error data up to 5 seconds before the computer panics, to obtain the defect processing suggestion.
[0060] In addition, the baseplate management controller 12 can be additionally configured to print the defect parsing information from the first error data or the defect parsing information from the first error data or the suggestion of defect processing, so that the user or the defect rectification team can process the defect on the computer according to the printed information.
[0061] In addition, the baseplate management controller 12 can additionally save at least one of the defect parsing information from the first error data, the analysis information
25/165
19/39 defect syntax of the second error data, the first error data, and the second error data in a computer fault information database, to obtain a fault recording of the computer, in order to provide help in subsequently locate the defect and recover from the defect. For example, the baseplate management controller 12 can save the defect parsing information from the first error data and the defect parsing information from the second error data to the defect information base, so that the defect information base save the complete error data, and can provide a complete defect recording. In this embodiment of the present invention, the defect information base can be defined on the baseboard management controller 12, or it can also be defined outside the baseboard management controller 12.
[0062] It should be noted that in a practical application process, different ways can be used according to different application scenarios to locate, analyze, and process a defect in a computer. For example, for a non-single node application scenario, a system can include multiple computers in accordance with this embodiment of the present invention. Each computer in accordance with that embodiment of the present invention may have the ability to locate, analyze, and process a defect. At that time, a computer's baseboard management controller (for example, a primary computer) from multiple computers can collect error data from other computers' baseboard management controllers, and the computer's management controller a computer's base plate performs fault finding, analysis, and processing jointly on all computers in the system. Alternatively, the baseboard management controllers of the multiple computers in the system can report error data obtained by the baseboard management controllers to a management device (for example, a management server) in the system, and the management performs fault finding, analysis, and processing jointly on all computers in the system using the method in this method modality.
[0063] In this embodiment of the present invention, a system
26/165
Operator 20/39 does not need to be used, only a 12 baseplate management controller is required to deploy error data acquisition to a computer after the computer is out of order, and a problem in the prior art that error data on a computer cannot be acquired after a serious incorrigible error that occurs on the computer causes a system crash to be resolved. In addition, the baseplate management controller 12 can additionally record a complete defect in the defect recording base, and can additionally parse the first error data, parse the defect parsing information from the first error data. according to a predefined defect processing mechanism, locate a source of defect, and provide a processing suggestion.
Mode 2 [0064] To better describe the present invention, several specific details are provided in specific implantation ways below. A person skilled in the art should understand that the present invention can also be implemented without some specific details. In this embodiment of the present invention, the structural composition and functions of processor 11 and baseplate management controller 12 in Mode 1 are presented in detail with reference to Figure 2.
[0065] Figure 2 is a schematic structural diagram of the composition of a computer according to this embodiment of the present invention. The computer includes a processor 11 and a baseplate management controller 12. Processor 11 can include a recording module 21, a storage module 22, and an instruction execution module 23. The recording module 21 can be specifically a Hardware Defect Verification Architecture (Machine Verification Architecture, MCA) responsible for internal functional modules of processor 11, and / or a Defect Reporting Mechanism (Advanced Error Reporting, AER) responsible for a PCIe standard of a device inputs from the computer. Correspondingly, the storage module 22 can be an MCA record and / or an AER record. The MCA record and the AER record can be located inside processor 11.0 instruction execution module 23 can be a processor core 11 and is
27/165
21/39 configured to execute an instruction from a basic input-output system and an instruction from an operating system.
[0066] The recording module 21 can be configured to acquire error data in the computer, for example, to produce error data generated as a defect occurs in internal functional modules of processor 11 or, as another example, to receive data from error generated as a defect occurs in the IO device. The error data on the computer, however, includes, without limitation, the first error data and the second error data in this embodiment of the present invention. The recording module 21 can record, on the storage module 22, the error data acquired on the computer. Specifically, if error data on the computer is acquired by MCA, MCA can write error data to the computer in the MCA record. If the error data on the computer is acquired by the AER, the AER can record the error data on the computer in the AER log, where a range of error data acquired by the MCA or the AER can be implemented by configuring the registry. corresponding using the BIOS. Optionally, when recording / after recording, in the corresponding record, the error data in the computer, the MAC or the AER can additionally save, in a first record, an address of a record that records the error data in the computer, so that subsequently the instruction execution module 23 can acquire the error data on the computer according to an error collection instruction of the basic input-output system and using the address recorded in the first record.
[0067] When acquiring the error data on the computer, recording module 21 can additionally trigger a system management interruption (System Management Interruption, SMI). The system management interrupt is configured to trigger the instruction execution module 23 to execute the basic input-output system error collection instruction. If the computer does not panic, instruction execution module 23 can acquire, from storage module 22, error data on the computer according to the basic input-output system error collection instruction, and send error data for the baseboard management controller 12. If the computer is out of order, instruction execution module 23 cannot execute
28/165
22/39 any computer instruction, where the basic input-output system error collection instruction can be configured in advance in a memory that stores the basic input-output system instruction.
[0068] In practice, it can be known according to Modality 1 that the second error data is error data generated within a predefined time before the computer generates the first error data and, therefore, the recording module 21 first acquires the second error data and then acquires the first error data. Therefore, when acquiring the second error data, in one aspect, the recording module 21 can record the second error data in the storage module 22 and, in another aspect, it can trigger the system management interruption. If the computer does not panic, instruction execution module 23 can execute the basic input-output system error collection instruction according to the system management interruption, and acquire the second error data from the storage module 22 according to the basic input-output system error collection instruction and send the second error data to the baseplate management controller 12. Optionally, instruction execution module 23 can send the second error data for the baseboard management controller 12 using an intelligent platform management interface standard (Intelligent Platform Management Interface, IPMI), and the baseboard management controller 12 can receive, using the IPMI standard, the second error data sent by instruction execution module 23. It should be noted that when the second error data includes multiple pieces the error data 21 and the recording module 21 can acquire only the second error data after multiple times, the recording module 21 can trigger the system management interrupt each time the recording module 21 acquires a portion of the second data error. Correspondingly, instruction execution module 23 can execute the basic input-output system error collection instruction multiple times to perform the send multiple times to send the second error data to the baseboard management controller 12. Optionally, after sending the second error data to the baseplate management controller 12, the
29/165
23/39 instruction execution module 23 can execute an operating system delete instruction to delete, according to the operating system delete instruction, the second error data saved in the recording module
21. In other words, the instruction execution module 23 can delete, from the storage module 22, the error data that was sent to the baseplate management controller 12, thereby avoiding repeated data transmission from error for baseboard management controller 12.
[0069] If recording module 21 acquires the first error data after acquiring the second error data, recording module 21 can also trigger the system management interruption. In addition, if the first error data is a type of serious uncorrectable error, that is, the first error data belongs to a catastrophic error or a fatal error, recording module 21 can additionally trigger an error indication serious, to notify the baseboard management controller 12 that a catastrophic error or fatal error occurs on the computer and can cause a crash. When the first error data is really of the serious incorrigible type and the computer effectively panics, instruction execution module 23 cannot execute a computer instruction, and even though recording module 21 has triggered the management interruption system, instruction execution module 23 cannot yet execute the basic input-output system error collection instruction and cannot acquire the first error data from storage module 22 for the board management controller- base 12. Therefore, if the baseplate management controller 12 does not receive at least a portion of the first error data sent by processor 11 within a predefined waiting time that starts from the time the occurrence indication serious defect is received, it can be determined that the computer is in a crash. Specifically, the triggering of the serious defect occurrence indication by recording module 21 can be implemented by changing a level of a CATEER N or ERROR N pin, and the baseplate management controller 12 can receive the indication of occurrence of serious defect due to the receipt of a level signal from the CATEER_N or ERROR_N pin.
30/165
24/39 [0070] When it is determined that the computer is out of order, the baseplate management controller 12 can send a read request message to the write module 21, where the read request message is used to request reading the first error data. After the computer is down, recording module 21 can still receive the read request message, and send a read response message to the baseboard management controller 12. Therefore, the board management controller -basic 12 can receive the read response message, and obtain the first error data according to the read response message recorded by processor 11. Specifically, the baseplate management controller 12 can traverse the MAC registry or the AER record using a platform environment control interface bus (Platform Environment Control Interface, PECI), in order to read the first error data from the MAC record or the AER record. If the baseplate management controller 12 reads the data successfully from the MAC record or the AER record, a read response message returned by the MAC record or the AER record carries the first error data, and the baseplate management controller 12 can acquire the first error data. If the baseplate management controller 12 does not read data from the MAC record or the AER record, a read response message returned by the MAC record or the AER record carries a read failure indication, for example. example, tampered characters. Therefore, the placebo management controller 12 can instruct a warm restart module or a computer user to perform a warm restart on the computer, so that instruction execution module 23 executes, during the warm restart of the computer. , a basic input-output system defect collection instruction, go through the MAC register or the AER register according to the basic input-output system defect collection instruction, acquire the first error data, and send the first error data to the baseplate management controller 12 using the IPMI standard, and the baseplate management controller 12 can receive the first error data sent by the defect collection instruction.
31/165
[0071] In this embodiment of the present invention, the baseboard management controller 12 cooperates with processor 11 to deploy error data acquisition to a computer after the computer is out of order, and a problem in the prior art. that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash to resolve.
Modality 3 [0072] This modality of the present invention provides a method for defect processing, used on the computer shown in Figure 1 or Figure 2, the computer including a baseboard management controller and a processor, where the method includes:
[0073] S301: When it is determined that the computer is out of order, a baseboard management controller sends a read request message to a processor, where the read request message is used to request read of first data from errors recorded by the processor.
[0074] The processor can acquire the first error data, and record the first error data. When the computer is determined to be in a panic, the baseboard management controller can send a read request message to the processor, to read the first error data written by the processor. At that time, although the computer has panicked and the processor cannot execute any computer instructions, the processor can receive and respond to the read request message, so that the board management controller can acquire the first error data . For example, the processor can write the first error data to a processor log, and the baseboard management controller can send the read request message to the processor log. The processor log can receive the read request message, and return a read response message. In that embodiment of the present invention, the first error data may include one or more pieces of error data, and there is no limitation in this document as to that embodiment of the present invention.
[0075] The baseboard management controller determines
32/165
26/39 that the computer is in trouble in multiple ways and, specifically, reference can be made to Mode 1 or Mode 2, and details are not described in the present document again in that embodiment of the present invention.
[0076] S302: The baseboard management controller receives a read response message returned by the processor, and obtains, according to the read response message, the first error data recorded by the processor.
[0077] If the baseboard management controller successfully reads the data from the processor, the read response message may carry the first error data, and the baseboard management controller can obtain, from the of the read response message, the first error data recorded by the processor. If the baseboard management controller does not read data from the processor, the read response message may carry a read failure indication, and the baseboard management controller may acquire the first error data from other way. For example, a defect collection instruction for the basic input-output system can be configured in advance on the computer. When the read response message carries a read failure indication, the placebo management controller can instruct a warm reset module or a computer user to perform a warm reset on the computer so that the processor performs, during the computer's warm restart, the defect in the computer input instruction of the basic input-output system, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data for the baseboard management controller, and the baseboard management controller can complete acquisition of the first error data by receiving the first error data sent by the processor.
[0078] In this embodiment of the present invention, when it is determined that the computer is out of order, a computer's baseboard management controller can send a read request message to a computer processor, where the
33/165
27/39 read request message is used to request read of first error data recorded by the processor, receive a read response message returned by the processor, and obtain, according to the read response message, the first read data errors recorded by the processor. Through this modality of the present invention, an operating system does not need to be used, only a baseboard management controller is required to implement error data acquisition on a computer after the computer is out of order, and a problem in the technique that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash to resolve.
Modality 4 [0079] This modality of the present invention provides a method for defect processing, used on the computer shown in Figure 1 or Figure 2, the computer including a baseboard management controller and a processor, where the method includes:
[0080] S401: A baseboard management controller receives an indication of a serious defect occurrence sent by a processor, where the indication of a serious defect occurrence is sent by the processor when the processor acquires first error data and the first data errors are a type of serious incorrigible error.
[0081] S402: The baseboard management controller sends an alarm message to a computer defect alarm module or performs a printing operation, to notify a user of the occurrence of a serious defect alarm.
[0082] After receiving the indication of the occurrence of a serious defect sent by the processor, the baseboard management controller can trigger a defect alarm sensor using the alarm message or perform the printing operation, to notify the user that a serious defect occurs in the computer and can cause a crash. In this embodiment of the present invention, S402 is an optional step.
[0083] S403: If the baseboard management controller does not receive at least part of the first error data sent by the processor within a predefined waiting time that starts from the
34/165
28/39 time when the indication of the occurrence of a serious defect is received, determine that the computer is in trouble, and perform step S404.
[0084] After the processor acquires the first error data, if the computer does not panic, the processor can execute an error collection instruction from the basic input-output system, and send the first error data to the controller. baseboard management according to the basic input-output system error collection instruction. If the computer is out of order, the processor cannot execute any computer instructions. Therefore, if the baseboard management controller does not receive at least a portion of the first error data sent by the processor within the predefined waiting time that starts from the time that the indication of the occurrence of a serious defect is received, it can be determined that the computer is in a panic.
[0085] S404: The baseboard management controller sends a read request message to the processor, where the read request message is used to request read of the first error data recorded by the processor.
[0086] After determining that the computer is in a panic, the baseboard management controller can acquire the first error data from the processor, to implement error data acquisition in the computer after the computer is in a panic.
[0087] S405: The baseboard management controller receives a read response message returned by the processor, and obtains, according to the read response message, the first error data recorded by the processor.
[0088] The baseboard management controller obtains, according to the read response message, the first error data recorded by the processor and, specifically, the way in S405a can be used, or the way in S405a can be used.
[0089] S405a: If the read response message carries the first error data, the baseboard management controller obtains, from the read response message, the first error data recorded by the processor.
[0090] If the read response message carries the first
35/165
29/39 error data, this indicates that the baseboard management controller reads the first error data successfully from the processor, and the baseboard management controller can obtain, from the response message from reading, the first error data recorded by the processor.
[0091] S405b: If the read response message carries a read failure indication, where the read failure indication is used to indicate that the first error data is not read from the processor, the management controller baseplate instructs a warm reset module or computer user to perform a warm restart on the computer, so that the processor performs, during the hot restart of the computer, a defect collection instruction for an input system -basic computer output, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data to the baseboard management controller; and the baseboard management controller receives the first error data sent by the processor.
[0092] The defect collection instruction for the basic input / output system can be configured in advance on the computer. When the baseboard management controller does not read the first error data from the processor, the read response message carries a read failure indication, and the baseboard management controller instructs the reset module to hot or the computer user to perform a hot restart on the computer, so that the processor executes, during the hot restart of the computer, the defect collection instruction of the basic computer input-output system, acquire the first data error message according to the basic input-output system defect collection instruction, and send it to the baseboard management controller.
[0093] S406: The baseboard management controller parses the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data.
[0094] After acquiring the first error data, the baseboard management controller parses the first
36/165
30/39 error according to the defect parsing mechanism, to obtain defect parsing information from the first error data, where the defect parsing information from the first error data may include: the time each piece of error data in the first error data is generated, who collects the error data, which processor the error data comes from, which core (Core), which error the error data belongs to, and the like. Defect parsing information can not only be provided to a maintenance team or the user to understand a case of a defect, but can also be additionally used for subsequent location, analysis, and processing of the defect.
[0095] S407: The baseboard management controller analyzes the defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain the defect processing suggestion.
[0096] The default defect processing mechanism can be a defect mechanism or defect processing experience for X86. The baseboard management controller analyzes the defect parsing information from the first error data according to the predefined defect processing mechanism, and obtains the defect processing suggestion, where the defect processing suggestion can be defect location information or processing suggestion information, so that the user or defect rectification team can perform processing on the computer according to the defect processing suggestion, to recover the computer.
[0097] S408: The baseboard management controller prints the defect processing suggestion.
[0098] After obtaining the defect processing suggestion, the baseboard management controller can print the defect processing suggestion, or can additionally print the defect processing suggestion and the defect parsing information from the first data error, so that the user or the fault rectification team can perform processing on the computer according to the printed information, to recover the computer.
37/165
31/39 [0099] In this embodiment of the present invention, an operating system does not need to be used, only a baseboard management controller is required to deploy error data acquisition to a computer after the computer is in a crash, and a problem in the prior art that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash to be resolved. In addition, the baseboard management controller can additionally parse the first error data, and analyze the defect parsing information from the first error data according to a predefined fault processing mechanism, to locate a source defect and provide a processing suggestion.
[00100] Since in step S407 only the defect parsing information from the first error data is analyzed to obtain the defect processing suggestion, the first error data can only be error data generated within a period very short time before the computer crashes. For example, the first error data is error data generated up to 2 seconds before the computer crashes and, therefore, to improve the accuracy of locating and analyzing a defect, defect parsing information for more error data can be analyzed.
[00101] Before step S403, the baseboard management controller can additionally receive second error data sent by the processor, where the second error data is error data generated within a predefined time before the computer generates the first error data.
[00102] Step S407 can additionally be: The baseboard management controller analyzes the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, and analyzes the defect parsing information of the second error data and the defect parsing information of the first error data to obtain the defect processing suggestion.
[00103] In this embodiment of the present invention, the
38/165
32/39 baseboard management can analyze the defect parsing information of the second error data and the defect parsing information of the first error data to obtain the defect processing suggestion in order to improve the forecast to locate and analyze a defect.
[00104] Optionally, after step S405, the baseplate management controller can additionally save at least one of the defect parsing information from the first error data, the defect parsing information from the second error data, the first error data, and the second error data in a computer defect information base. For example, the defect parsing information of the first error data and the defect parsing information of the second error data is saved in the defect information base, or the first error data and the second error data are saved. saved in the defect information base, in order to record a complete defect recording in the defect recording base.
[00105] Optionally, after step S405, the baseboard management controller can additionally send a clear data message to the processor, to instruct the processor to delete the first error data recorded by the processor, thereby avoiding waste of a storage resource.
[00106] For the baseplate management controller in Mode 3 or Mode 4 of the present invention, specifically, reference can be made to the interaction and defect processing of the base plate management controller in Mode 1 or in Mode 2 of present invention and to a processor.
Mode 5 [00107] This mode of the present invention provides a baseboard management controller, used on a computer including the baseboard management controller and a processor, for example, used on the computer shown in Figure 1 or 2. As shown in Figure 5, the baseboard management control can include a sending unit and a receiving unit.
[00108] The sending unit is configured for: when it is determined that the computer is in a panic, send a message of
39/165
33/39 read request to the processor, where the read request message is used to request read of first error data recorded by the processor. Although the computer has crashed and the processor cannot execute any computer instructions, the processor can receive and respond to the read request message.
[00109] The receiving unit is configured to receive a read response message returned by the processor, and obtain, according to the read response message, the first error data recorded by the processor. For example, when the read response message carries the first error data, the receiving unit can obtain, from the read response message, the first error data recorded by the processor. As another example, when the read response message carries a read failure indication, the receiving unit can instruct a warm reset unit or a computer user to perform a warm reset on the computer, so that the processor execute, during the computer's warm restart, a defect collection instruction for a basic input-output system, acquire the first error data according to the defect collection instruction for the basic input-output system, and send the first error data for the receiving unit, where the read failure indication is used to indicate that the first error data was not read from the processor; and the receiving unit receives the first error data sent by the processor. Optionally, after the first error data is acquired, the receiving unit can additionally send a clear data message to the processor, to instruct the processor to delete the first error data recorded by the processor, thereby avoiding a waste of a resource. of storage.
[00110] Optionally, the placebo management controller can additionally include a determination unit, configured to receive an indication of a serious defect occurrence sent by the processor, where the indication of a serious defect occurrence is sent by the processor when the processor acquires the first error data and first error data are of a type of serious incorrigible error; and if at
40/165
34/39 minus a portion of the first error data sent by the processor is not received within a predefined waiting time that starts from the time when the indication of the occurrence of a serious defect is received, determining that the computer is in trouble .
[00111] Optionally, the board management controller can additionally include a defect alarm unit, configured to: after the determination unit receives the indication of the occurrence of a serious defect sent by the processor, send an alarm message to the control unit computer defect alarm or perform a printing operation, to notify the user of the occurrence of a serious defect alarm.
[00112] Optionally, the placabase management controller can additionally include a defect processing unit, configured to parse the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data. The defect parsing information for the first error data can include: the time when each piece of error data in the first error data is generated, who collects the error data, from which processor do the error data come from, which core (Core), which error the error data belongs to, and the like. Defect parsing information can not only be provided to a maintenance team or the user to understand a case of a defect, but can also be additionally used for subsequent location, analysis, and processing of the defect.
[00113] In addition, the defect processing unit can be additionally configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion. The default defect processing mechanism can be a defect mechanism or defect processing experience for X86. The defect processing suggestion can be defect location information or processing suggestion information, so that the user or defect rectification team can perform processing on the computer according to the
41/165
35/39 defect processing, to recover the computer.
[00114] Since the defect processing unit analyzes the defect parsing information only from the first error data to obtain the defect processing suggestion, the first error data can only be error data generated within a very short period of time before the computer crashes. For example, the first error data is error data generated within 0.8 seconds before the computer crashes and, therefore, to improve the accuracy of finding and analyzing a defect, the defect processing unit can analyze information defect parsing of more error data. Specifically, the receiving unit is additionally configured to receive second error data sent by the processor, where the second error data is error data generated within a predefined time before the computer generates the first error data, and can parse the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, and to analyze the defect parsing information of the second error data and the parsing information of the first error data according to the predefined defect processing mechanism, to obtain the defect processing suggestion.
[00115] Optionally, the defect processing unit is additionally configured to print the defect parsing information from the first error data or the defect processing suggestion.
[00116] Optionally, the defect processing unit is additionally configured to save at least one of the defect parsing information from the first error data, the defect parsing information from the second error data, the first error data , and the second error data in a computer's defect information base. For example, the defect parsing information of the first error data and the defect parsing information of the second error data is saved in the defect information base, or the first error data and the second error data are saved. saved in the database
42/165
36/39 defect information, in order to record a complete defect recording on the defect recording base.
[00117] For the baseboard management controller in this embodiment of the present invention, specifically, reference can be made to the interaction and defect processing of the baseboard management controller in Mode 1 or Mode 2 of the present invention and a processor.
[00118] In this embodiment of the present invention, when it is determined that the computer is out of order, the sending unit can send a read request message to a computer processor, where the read request message is used to request read data. first error data recorded by the processor, and the receiving unit can receive a read response message returned by the processor, and obtain, according to the read response message, the first error data recorded by the processor. Through this modality of the present invention, an operating system does not need to be used, only a baseboard management controller is required to implement error data acquisition on a computer after the computer is out of order, and a problem in the technique that error data on a computer cannot be acquired after a serious uncorrectable error that occurs on the computer causes a system crash to resolve.
[00119] One embodiment of the present invention provides a computer-readable medium, including a computer-executable instruction, so that when a computer processor executes the computer-executable instruction, the computer can perform the method for defect processing in the Mode 3 or in Mode 4.
[00120] Figure 6 shows a baseboard management controller provided in an embodiment of the present invention, where the baseboard management controller can include:
a processor 601, a memory 602, a system bus 604, and a communications interface 605, where processor 601, memory 602, and communications interface 605 are connected and communicate with each other using the system bus 604.
43/165
37/39 [00121] Processor 601 can be a single-core or multi-core central processing unit, or it can be a specific integrated circuit, or it can be configured as one or more integrated circuits to implement this embodiment of the present invention.
[00122] The 602 memory can be a high speed RAM memory, or it can be a non-volatile memory (non-volatile memory), for example, at least a disk memory.
[00123] Memory 602 is configured for a computer executable instruction 603. Specifically, the computer executable instruction 603 can include program code.
[00124] When the baseboard management controller runs, processor 601 executes computer executable instruction 603 to perform a method method procedure for defect processing in Mode 3 or Mode 4.
[00125] A person of ordinary skill in the art can understand that each aspect of the present invention or a possible way of implanting each aspect can be specifically implanted as a system, a method, or a computer program product. Therefore, each aspect of the present invention or a possible way of implementing each aspect may use forms of hardware-only modalities, software-only modalities (including firmware, resident software, and the like), or modalities with a combination of software and hardware, which are uniformly referred to as circuit, module, or system in this document. In addition, each aspect of the present invention or the manner of possible implementation of each aspect may take the form of a computer program product, where the computer program product refers to computer-readable program code stored in a readable medium per computer.
[00126] The computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium, however, includes, without limitation, a magnetic, optical, electromagnetic, infrared, or semiconductive electronic system, device or apparatus, or any appropriate combination thereof, such as random access memory (RAM),
44/165
38/39 a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, and a compact disc - read-only memory (CD-ROM).
[00127] A processor in a computer reads computer-readable program code stored in a computer-readable medium, so that the processor can perform a specified function and action at each step or a combination of steps in a flowchart; a device is generated to implement a function and action specified in each block or a combination of blocks in a block diagram.
[00128] All computer-readable program code can be run on a user's computer, or some can be run on a user's computer as a stand-alone software package, or some can be run on a user's computer while some are run on a remote computer, or all code can be run on a remote computer or a computer. It should also be noted that, in some alternative deployment solutions, each step in the flowcharts or functions specified in each block in the block diagrams may not occur in the order illustrated. For example, two consecutive steps or two blocks in the illustration, which are dependent on an involved function, can actually be performed at substantially the same time, or these blocks can sometimes be performed in reverse order.
[00129] A person of ordinary skill in the art may be aware that, in combination with the examples described in the modalities disclosed in this specification, algorithm units and steps can be implemented by electronic hardware or a combination of computer software and electronic hardware . Whether or not the functions are performed by hardware or software depends on particular applications and conditions of design restriction of technical solutions. A person skilled in the art can use different methods to implement the functions described for each particular application, but the deployment should not be considered to be beyond the scope of the present invention.
[00130] The aforementioned descriptions are merely specific means of implantation of the present invention, but are not intended to limit the scope of protection of the present invention. Any variation or
45/165
39/39 replacement readily perceived by a person skilled in the art within the technical scope disclosed in the present invention will be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention will be subject to the scope of protection of the claims.

权利要求:
Claims (26)
[1]
1. Computer, comprising a processor (11) and a baseboard management controller (12), CHARACTERIZED by the fact that the baseboard management controller (12) is configured for: when it is determined that the computer is in panic, send a read request message to the processor (11), in which the read request message is used to request read of the first error data recorded by the processor (11);
the processor (11) is configured to receive the read request message, and send a read response message to the baseboard management controller (12); and the baseplate management controller (12) is configured to receive the read response message returned by the processor (11);
when the read response message bears a read failure indication, the baseboard management controller (12) is configured to instruct a warm reset module or a computer user to perform a warm reset on the computer, where the read failure indication is used to indicate that the first error data has not been read from the processor (11), so that the processor (11) executes, during the hot restart of the computer, a defect collection of a basic computer input-output system, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data to the data management controller base plate (12); and the baseplate management controller (12) is configured to receive the first error data sent by the processor (11).
[2]
2. Computer, according to claim 1, CHARACTERIZED by the fact that the processor (11) is additionally configured to acquire the first error data, and to record the first error data; and that the baseboard management controller (12) is configured to determine that the computer is
Petition 870180153319, of 11/21/2018, p. 8/16
2/9 specifically:
the baseplate management controller (12) is configured to receive a defect occurrence indication sent by the processor (11), where the defect occurrence indication is sent by the processor (11) when the processor (11) acquires the first error data and the first error data are of an incorrigible error type; and if at least a portion of the first error data sent by the processor (11) is not received within a predefined waiting time that starts from the time that the fault indication is received, the fault management controller baseplate (12) is configured to determine that the computer is in a panic.
[3]
3. Computer, according to claim 1 or 2, CHARACTERIZED by the fact that when the read response message carries the first error data, the baseboard management controller (12) is configured to obtain, from of the read response message, the first error data recorded by the processor (11).
[4]
4. Computer according to any of claims 1 to 2, CHARACTERIZED by the fact that the baseplate management controller (12) is additionally configured to send a clear data message to the processor (11), to instruct the processor (11) to delete the first error data recorded by the processor (11).
[5]
5. Computer, according to claim 2, CHARACTERIZED by the fact that placabase management controller (12) is additionally configured to: after the indication of the occurrence of a defect sent by the processor (11) is received, send an alarm message to a computer defect alarm module or perform a printing operation, to notify a user of the occurrence of a defect alarm.
[6]
6. Computer according to any of claims 1 to 5, CHARACTERIZED by the fact that the baseplate management controller (12) is additionally configured to parse the first error data according to a parsing mechanism defect, to obtain defect parsing information from the first error data.
Petition 870180153319, of 11/21/2018, p. 9/16
3/9
[7]
7. Computer, according to claim 6, CHARACTERIZED by the fact that the placabase management controller (12) is additionally configured to analyze defect parsing information from the first error data according to a processing mechanism default defect, for a suggested defect processing.
[8]
8. Computer, according to claim 7, CHARACTERIZED by the fact that before it is determined that the computer is out of order, the baseplate management controller (12) is additionally configured to receive second error data sent by the processor (11), and analyze the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, where the second error data is error data generated within a predefined time before the computer generates the first error data; and the baseplate management controller (12) is configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion comprising:
the baseplate management controller (12) is configured to analyze the defect parsing information of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism , to obtain the defect processing suggestion.
[9]
9. Computer according to any of claims 6 to 8, CHARACTERIZED by the fact that the baseplate management controller (12) is additionally configured to print the defect parsing information from the first error data or the defect processing suggestion.
[10]
10. Computer according to any one of claims 6 to 8, CHARACTERIZED by the fact that the baseplate management controller (12) is additionally configured to save at least one of the defect parsing information from the first error, the
Petition 870180153319, of 11/21/2018, p. 10/16
4/9 defect parsing information from the second error data, the first error data, and the second error data in a computer defect information base.
[11]
11. Method for defect processing, applied to a computer comprising a baseboard management controller and a processor, and the method comprises:
when it is determined that the computer is in a panic, send a read request message to the processor by the baseplate management controller, CHARACTERIZED by the fact that the read request message is used to request read of first data error messages recorded by the processor; and receiving, by the baseplate management controller, a read response message returned by the processor;
When the read response message carries a read failure indication, instruct the hotplate management controller, or a computer user, to perform a warm restart on the computer by the baseplate management controller, so that the processor, during the computer's warm restart, a defect collection instruction for a basic computer input-output system, acquire the first error data according to the defect collection instruction for the basic input-output system , and send the first error data to the baseboard management controller, where the read failure indication is used to indicate that the first error data was not read from the processor; and receive the first error data sent by the processor by the baseplate management controller.
[12]
12. Method, according to claim 11, CHARACTERIZED by the fact that the method additionally comprises:
receive, by the baseboard management controller, a defect occurrence indication sent by the processor, in which the defect occurrence indication is sent by the processor when the processor acquires the first error data and the first error data is a type of incorrigible error; and if at least part of the first error data sent by the processor is not received within a certain time
Petition 870180153319, of 11/21/2018, p. 11/16
5/9 predefined wait that starts from the time when the indication of the occurrence of defect is received, determine that the computer is in a panic.
[13]
13. Method, according to claim 11 or 12, CHARACTERIZED by the fact that:
when the read response message carries the first error data, obtain, by the baseplate management controller from the read response message, the first error data recorded by the processor.
[14]
14. Method, according to any one of claims 13 to 14, CHARACTERIZED by the fact that the method additionally comprises: parsing, by the baseboard management controller, the first error data according to a parsing mechanism defect, to obtain defect parsing information from the first error data.
[15]
15. Method, according to claim 14, CHARACTERIZED by the fact that the method additionally comprises: analyzing, by the baseplate management controller, the defect parsing information of the first error data according to a predefined defect processing, for a defect processing suggestion.
[16]
16. Method, according to claim 15, CHARACTERIZED by the fact that before it is determined, by the baseboard management controller, that the computer is in a breakdown, the method additionally comprises: receiving, by the board management controller -basic, second error data sent by the processor, where the second error data is error data generated within a predefined time before the computer generates the first error data; and the analysis, by the baseboard management controller, of the defect parsing information of the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion comprises:
parsing, by the placabase management controller, the second error data according to the defect parsing mechanism, to obtain parsing defect parsing information from
Petition 870180153319, of 11/21/2018, p. 12/16
6/9 second error data, and analyze the defect parsing information of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the suggestion defect processing.
[17]
17. Method according to any of claims 14 to 16, CHARACTERIZED by the fact that the method additionally comprises: printing, by the baseplate management controller, the defect parsing information from the first error data or the defect processing suggestion.
[18]
18. Method according to any one of claims 14 to 16, CHARACTERIZED by the fact that the method additionally comprises: saving, by the baseplate management controller, at least one of the defect parsing information from the first data of error, the defect parsing information of the second error data, the first error data, and the second error data in a computer defect information base.
[19]
19. Baseplate management controller (50), CHARACTERIZED by the fact that it comprises:
a sending unit (501), configured for: when a computer is determined to be out of order, send a read request message to a processor, where the read request message is used to request read of first error data recorded by the processor; and a receiving unit (502) configured to receive a read response message returned by the processor, when the read response message carries a read failure indication, the receiving unit (502) instructs a reset unit to hot or a computer user performing a warm restart on the computer, so that the processor executes, during the hot restart of the computer, a defect collection instruction for a basic computer input-output system, acquire the first error data according to the defect collection instruction of the basic input-output system, and send the first error data to the receiving unit
Petition 870180153319, of 11/21/2018, p. 13/16
7/9 (502), where the read failure indication is used to indicate that the first error data was not read from the processor; and the receiving unit (502) receives the first error data sent by the processor.
[20]
20. Baseplate management controller (50), according to claim 19, FEATURED by the fact that it additionally comprises:
a determination unit (503), configured to receive a defect occurrence indication sent by the processor, where the defect occurrence indication is sent by the processor when the processor acquires the first error data and the first error data is a type of incorrigible error; and if at least a portion of the first error data sent by the processor is not received within a predefined waiting time that starts from the time the fault indication is received, determine that the computer is in a panic.
[21]
21. Baseboard management controller (50) according to claim 19 or 20, CHARACTERIZED by the fact that when the read response message carries the first error data, the receiving unit (502) obtains, from the read response message, the first error data recorded by the processor.
[22]
22. Baseboard management controller (50), according to any one of claims 19 to 23, CHARACTERIZED by the fact that it additionally comprises:
a defect processing unit (505), configured to parse the first error data according to a defect parsing mechanism, to obtain defect parsing information from the first error data.
[23]
23. Baseboard management controller (50) according to claim 22, CHARACTERIZED by the fact that the defect processing unit (505) is additionally configured to analyze defect parsing information from the first error according to a predefined defect processing mechanism, to obtain a defect processing suggestion.
[24]
24. Baseboard management controller (50), according to
Petition 870180153319, of 11/21/2018, p. 14/16
8/9 with claim 23, CHARACTERIZED by the fact that the receiving unit (502) is additionally configured to receive second error data sent by the processor;
the defect processing unit (505) is additionally configured to parse the second error data according to the defect parsing mechanism, to obtain defect parsing information from the second error data, in which the second data error messages are error data generated within a predefined time before the computer generates the first error data; and the defect processing unit (505) is configured to analyze defect parsing information from the first error data according to a predefined defect processing mechanism, to obtain a defect processing suggestion comprising:
the defect processing unit (505) analyzes the defect parsing information of the second error data and the defect parsing information of the first error data according to the predefined defect processing mechanism, to obtain the suggestion defect processing.
[25]
25. Baseboard management controller (50) according to any of claims 22 to 24, CHARACTERIZED by the fact that the defect processing unit (505) is additionally configured to save at least one of the analysis information defect syntax of the first error data, defect parsing information of the second error data, the first error data, and the second error data in a computer defect information base.
[26]
26. Baseboard management controller, FEATURED by the fact that the placebo management controller comprises a processor (601), a memory (602), a system bus (604), and a communications interface (605) , where the memory (602) is configured to store a computer executable instruction, the processor (601) is connected to the memory (602) using the bus, and when the baseboard management controller runs, the processor (601 ) executes the computer executable instruction
Petition 870180153319, of 11/21/2018, p. 15/16
9/9 stored in memory (602), so that the baseboard management controller performs the method for processing as defined in any of claims 11 to 18.

类似技术:

公开号 | 公开日 | 专利标题

BR112016022329B1|2019-01-02|defect processing method, related apparatus, and computer

WO2015039598A1|2015-03-26|Fault locating method and device

US9753809B2|2017-09-05|Crash management of host computing systems in a cluster

JP6602354B2|2019-11-06|Bus hang detection

WO2017063505A1|2017-04-20|Method for detecting hardware fault of server, apparatus thereof, and server

US10275330B2|2019-04-30|Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus

US10430267B2|2019-10-01|Determine when an error log was created

US20180150345A1|2018-05-31|Serializing machine check exceptions for predictive failure analysis

CN109164780A|2019-01-08|A kind of industrial field device control method based on edge calculations, apparatus and system

Panda et al.2019|{IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services

US20170212815A1|2017-07-27|Virtualization substrate management device, virtualization substrate management system, virtualization substrate management method, and recording medium for recording virtualization substrate management program

US10545841B2|2020-01-28|Method and apparatus for backup communication

WO2020239060A1|2020-12-03|Error recovery method and apparatus

JP2017078998A|2017-04-27|Information processor, log management method, and computer program

US10514972B2|2019-12-24|Embedding forensic and triage data in memory dumps

TWI715005B|2021-01-01|Monitor method for demand of a bmc

US11182232B2|2021-11-23|Detecting and recovering from fatal storage errors

TWI602054B|2017-10-11|Method of providing error status data for computer device

JP2014203181A|2014-10-27|Fault diagnosis device and program

Sultania2017|Monitoring and Failure Recovery of Cloud-Managed Digital Signage

JP2006119778A|2006-05-11|Information processing system, input/output device, method for use therewith for automatically sending data during system failure, and its program

TW201337542A|2013-09-16|Testing method

同族专利:

公开号 | 公开日

AU2014399227B2|2017-07-27|

JP6333410B2|2018-05-30|

WO2015196365A1|2015-12-30|

CA2942045A1|2015-12-30|

EP3355197A1|2018-08-01|

EP3121726A1|2017-01-25|

EP3121726B1|2018-01-31|

US20190332453A1|2019-10-31|

ZA201606180B|2019-04-24|

SG11201607545PA|2016-10-28|

US20170102985A1|2017-04-13|

EP3121726A4|2017-05-03|

ES2667322T3|2018-05-10|

KR20160128404A|2016-11-07|

AU2014399227A1|2016-09-22|

CN105659215B|2017-08-25|

US20210182136A1|2021-06-17|

JP2017517060A|2017-06-22|

DK3121726T3|2018-05-22|

CN107357671A|2017-11-17|

US10353763B2|2019-07-16|

KR101944874B1|2019-02-01|

NO3121726T3|2018-06-30|

CN105659215A|2016-06-08|

CA2942045C|2019-04-16|

EP3355197B1|2019-10-23|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

JP3902564B2|2003-04-15|2007-04-11|中部日本電気ソフトウェア株式会社|Fault reporting device and fault reporting method|

JP2005251060A|2004-03-08|2005-09-15|Hitachi Ltd|Failure indication device, and failed portion indication method|

US7409594B2|2004-07-06|2008-08-05|Intel Corporation|System and method to detect errors and predict potential failures|

US7546487B2|2005-09-15|2009-06-09|Intel Corporation|OS and firmware coordinated error handling using transparent firmware intercept and firmware services|

US20070088988A1|2005-10-14|2007-04-19|Dell Products L.P.|System and method for logging recoverable errors|

US20070234123A1|2006-03-31|2007-10-04|Inventec Corporation|Method for detecting switching failure|

US7594144B2|2006-08-14|2009-09-22|International Business Machines Corporation|Handling fatal computer hardware errors|

US8024609B2|2009-06-03|2011-09-20|International Business Machines Corporation|Failure analysis based on time-varying failure rates|

JP5514643B2|2010-06-21|2014-06-04|株式会社日立ソリューションズ|Failure cause determination rule change detection device and program|

CN102375775B|2010-08-11|2014-08-20|英业达股份有限公司|Computer system unrecoverable error indication signal detection circuit|

JP5541519B2|2010-10-06|2014-07-09|エヌイーシーコンピュータテクノ株式会社|Information processing apparatus, failure part determination method, and failure part determination program|

CN102467440A|2010-11-09|2012-05-23|鸿富锦精密工业（深圳）有限公司|Internal memory error detection system and method|

US8898408B2|2011-12-12|2014-11-25|Dell Products L.P.|Memory controller-independent memory mirroring|

EP2859459B1|2012-06-06|2019-12-25|Intel Corporation|Recovery after input/ouput error-containment events|

CN103514068A|2012-06-28|2014-01-15|北京百度网讯科技有限公司|Method for automatically locating internal storage faults|

JP6087540B2|2012-08-30|2017-03-01|Ｎｅｃプラットフォームズ株式会社|Fault trace apparatus, fault trace system, fault trace method, and fault trace program|

CN103647804B|2013-11-22|2017-04-26|华为技术有限公司|Method for data processing of storage unit, device and system|CN105975377B|2016-04-29|2018-05-25|浪潮电子信息产业股份有限公司|A kind of method and device for monitoring memory|

CN107077408A|2016-12-05|2017-08-18|华为技术有限公司|Method, computer system, baseboard management controller and the system of troubleshooting|

JP2018160009A|2017-03-22|2018-10-11|Ｎｅｃプラットフォームズ株式会社|Failure information processing program, computer, failure notification method, and computer system|

CN108108259A|2018-01-11|2018-06-01|郑州云海信息技术有限公司|A kind of kernel Fault Locating Method and device|

CN108958965B|2018-06-28|2021-03-02|苏州浪潮智能科技有限公司|Method, device and equipment for monitoring recoverable ECC errors by BMC|

CN109240847A|2018-09-27|2019-01-18|郑州云海信息技术有限公司|EMS memory error report method, device, terminal and storage medium during a kind of POST|

US10846162B2|2018-11-29|2020-11-24|Oracle International Corporation|Secure forking of error telemetry data to independent processing units|

CN112346786A|2019-08-08|2021-02-09|佛山市顺德区顺达电脑厂有限公司|Debugging information recording method applied to startup stage and operation stage after startup|

US11243859B2|2019-10-09|2022-02-08|Microsoft Technology Licensing, Llc|Baseboard management controller that initiates a diagnostic operation to collect host information|

US11132314B2|2020-02-24|2021-09-28|Dell Products L.P.|System and method to reduce host interrupts for non-critical errors|

CN113535502A|2020-04-17|2021-10-22|捷普科技（上海）有限公司|Error log collecting method for server system|

US11204821B1|2020-05-07|2021-12-21|Xilinx, Inc.|Error re-logging in electronic systems|

CN112256467A|2020-10-23|2021-01-22|英业达科技有限公司|Error type judging system and method thereof|

US11269729B1|2020-12-21|2022-03-08|Microsoft Technology Licensing, Llc|Overloading a boot error signaling mechanism to enable error mitigation actions to be performed|

法律状态:
2018-09-04| B06A| Notification to applicant to reply to the report for non-patentability or inadequacy of the application [chapter 6.1 patent gazette]|

2018-12-04| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2019-01-02| B16A| Patent or certificate of addition of invention granted|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 24/06/2014, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

PCT/CN2014/080618|WO2015196365A1|2014-06-24|2014-06-24|Fault processing method, related device and computer|

[返回顶部]